Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks

نویسندگان

چکیده

Lipreading aims to recognize sentences being spoken by a talking face. In recent years, the lipreading method has achieved high level of accuracy on large datasets and made breakthrough progress. However, is still far from solved, existing methods tend have error rates wild data defects disappearing training gradient slow convergence. To overcome these problems, we proposed an efficient end-to-end sentence-level model, using encoder based 3D convolutional network, ResNet50, Temporal Convolutional Network (TCN), CTC objective function as decoder. More importantly, architecture incorporates TCN feature learner decode feature. It can partly eliminate RNN (LSTM, GRU) disappearance insufficient performance, this yields notable performance improvement well faster Experiments show that convergence speed are 50% than state-of-the-art method, improved 2.4% GRID dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LipNet: End-to-End Sentence-level Lipreading

Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rat...

متن کامل

End-to-End Multi-View Lipreading

Non-frontal lip views contain useful information which can be used to enhance the performance of frontal view lipreading. However, the vast majority of recent lipreading works, including the deep learning approaches which significantly outperform traditional approaches, have focused on frontal mouth images. As a consequence, research on joint learning of visual features and speech classificatio...

متن کامل

End-to-End Kernel Learning with Supervised Convolutional Kernel Networks

In this paper, we introduce a new image representation based on a multilayer kernelmachine. Unlike traditional kernel methods where data representation is decoupledfrom the prediction task, we learn how to shape the kernel with supervision. Weproceed by first proposing improvements of the recently-introduced convolutionalkernel networks (CKNs) in the context of unsupervised lear...

متن کامل

LCANet: End-to-End Lipreading with Cascaded Attention-CTC

Machine lipreading is a special type of automatic speech recognition (ASR) which transcribes human speech by visually interpreting the movement of related face regions including lips, face, and tongue. Recently, deep neural network based lipreading methods show great potential and have exceeded the accuracy of experienced human lipreaders in some benchmark datasets. However, lipreading is still...

متن کامل

LipNet: Sentence-level Lipreading

Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequenc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Applied sciences

سال: 2021

ISSN: ['2076-3417']

DOI: https://doi.org/10.3390/app11156975